Homework 1: Some Fancy Climate Dataset Visualizations
Multi-line graphs? Choropleths? Zooming and hovering in your plots? You want it? It’s yours my friend, as long as you have plotly.
Part 1 of our class has focused on how to make fancy data visualization, namely using cool tools like plotly and some advanced data wrangling techniques in pandas.
So towards that purpose, we’re going to be making some neat data graphics regarding the NOAA climate data that we’ve been exploring. Our main tool for this job will be plotly, which makes it pretty easy to make really cool interactive plots, even of geographic data. But we’ll get to that.
Creating the Database
First of all, as discussed in class, we want to be creating a database for easy querying with SQL queries. But instead of reading in the our data directly, we still do need to write some function to clean up the data so that we’ll have an easier time feeding them into plotly later.
For that purpose, let’s actually read in the csv directly with pandas, but only a small subset. We’ll inspect this subset to see the data cleaning process.
import pandas as pd
At this point, I’d like to address a question that I suppose can be considered a beginner topic for a second-level Python class: how do you read in a very large file? What if the file is larger than the memory you have available? This is relevant here because unfortunately, the file we want to read in is quite large. Not large enough to not fit into memory, but enough to cause inconvenience whenever we want to straight-up read it in.
Answering this question will be our focus in this preliminary section. Let me start answering it by asking another more for-fun question: I know many of you know about pandas, and of course the famous pd.read_csv. But did you know that you can provide a kwarg called chunksize to it?
df_iter = pd.read_csv("temps.csv", chunksize = 100)
You learn new tricks about your old dogs all the time I guess in programming. After all pd.read_csv has like 500 kwargs give or take, and I only know what around 4 of those do.
Either way, this is the takeaway cool thing from providing this chunksize: instead of just giving you the straight DataFrame that is generated by reading in this csv, we actually now get a generator with each element being another “chunk” of the entire giant DataFrame.
df = df_iter.__next__()
df.head()
| ID | Year | VALUE1 | VALUE2 | VALUE3 | VALUE4 | VALUE5 | VALUE6 | VALUE7 | VALUE8 | VALUE9 | VALUE10 | VALUE11 | VALUE12 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ACW00011604 | 1961 | -89.0 | 236.0 | 472.0 | 773.0 | 1128.0 | 1599.0 | 1570.0 | 1481.0 | 1413.0 | 1174.0 | 510.0 | -39.0 |
| 1 | ACW00011604 | 1962 | 113.0 | 85.0 | -154.0 | 635.0 | 908.0 | 1381.0 | 1510.0 | 1393.0 | 1163.0 | 994.0 | 323.0 | -126.0 |
| 2 | ACW00011604 | 1963 | -713.0 | -553.0 | -99.0 | 541.0 | 1224.0 | 1627.0 | 1620.0 | 1596.0 | 1332.0 | 940.0 | 566.0 | -108.0 |
| 3 | ACW00011604 | 1964 | 62.0 | -85.0 | 55.0 | 738.0 | 1219.0 | 1442.0 | 1506.0 | 1557.0 | 1221.0 | 788.0 | 546.0 | 112.0 |
| 4 | ACW00011604 | 1965 | 44.0 | -105.0 | 38.0 | 590.0 | 987.0 | 1500.0 | 1487.0 | 1477.0 | 1377.0 | 974.0 | 31.0 | -178.0 |
Some important points to know for later: the ID column is the ID of a certain weather station in the world that recorded its data in a particular year. You might notice that with just the info in this DataFrame, we have no way of knowing where these temperatures are coming from. However, the ID will be very useful later when we want to find where that station is.
As we’ve been doing on this temperatures dataset for the past week or so in class, we can borrow this cleaning function to make our lives a bit easier, especially for joining later.
def prepare_df(df):
# standard method for stacking, so we can have a row for each month
df = df.set_index(keys=["ID", "Year"])
df = df.stack()
df = df.reset_index()
# the stack command gives some basic names to the columns, but
# let's actually set them to be more meaningful
df = df.rename(columns = {"level_2" : "Month" , 0 : "Temp"})
# data from each month is in a column called "value#", so we can get
# the month number as the last character
df["Month"] = df["Month"].str[5:].astype(int)
# temps are currently in units of hundredths of Celcius, so convert
# back to standard Celcius
df["Temp"] = df["Temp"] / 100
# the FIPS country code is just the first two characters
# of the ID
df["FIPS"] = df["ID"].str[:2]
return(df)
One important point I’d like to mention here is the addition of the FIPS column. This will be useful later because we can use this FIPS code to join with our country data.
Either way, if we look at the cleaned-up DataFrame,
df = prepare_df(df)
df.head()
| ID | Year | Month | Temp | FIPS | |
|---|---|---|---|---|---|
| 0 | ACW00011604 | 1961 | 1 | -0.89 | AC |
| 1 | ACW00011604 | 1961 | 2 | 2.36 | AC |
| 2 | ACW00011604 | 1961 | 3 | 4.72 | AC |
| 3 | ACW00011604 | 1961 | 4 | 7.73 | AC |
| 4 | ACW00011604 | 1961 | 5 | 11.28 | AC |
That looks great! So now that we have a nice working cleaning function, our next task is to chunk by chunk read in our data to an sqlite3 database file.
First, we have to create this database file.
import sqlite3
conn = sqlite3.connect("temps.db") # create a database in current directory called temps.db
Now this is where we can actually read our csv in smaller bite-sized chunks. There’s a neat argument in pd.read_csv called chunksize which basically gives an iterator where we can get chunks of our dataframe one at a time. This is perfect for reading into our database.
df_iter = pd.read_csv("temps.csv", chunksize = 100000)
for df in df_iter:
df = prepare_df(df)
df.to_sql("temperatures", conn, if_exists = "append", index = False)
Next, we bring in the stations metadata, which gives us the precise location of each station given its ID. Joining on this column is exactly how we’re going to get locational data for each of our observations.
stations = pd.read_csv("https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/noaa-ghcn/station-metadata.csv")
stations.to_sql("stations", conn, if_exists = "replace", index = False)
Finally, we bring in a file that helps us decipher which country each station is in. Recall that we created that FIPS column for each temperature observation. To help us decipher in what country each observation was taken, we can bring in this table of country data with each one’s FIPS code as well as human-readable name.
countries_url = "https://raw.githubusercontent.com/mysociety/gaze/master/data/fips-10-4-to-iso-country-codes.csv"
countries = pd.read_csv(countries_url)
countries = countries.rename({"Name": "Country Name"}, axis=1)
countries.to_sql("countries", conn, if_exists = "replace", index = False)
D:\anaconda3\envs\PIC16B\lib\site-packages\pandas\core\generic.py:2789: UserWarning:
The spaces in these column names will not be changed. In pandas versions < 0.14, spaces were converted to underscores.
conn.close()
I gave one of my peers the suggestion to talk a good amount about how each of these tables relate to each other so the SQL command is a bit less mysterious to someone not as familiar with this dataset.
Querying the Database
On its own, “querying the database” sounds rather easy even if you know just the most basic SQL. However, the couple slightly complicating factors we have to consider:
- Remember that all the information we want is split across two tables in the database. That means we actually have to perform a join. As long as we just know the basic idea of joins, this won’t actually be too bad.
- We’ll be wanting to query our database a couple times, and asking about possibly different subsets of data (like different countries).
This sounds like a good time to make a function.
def query_climate_database(year_begin, year_end, month, country=None, min_elev=None):
conn = sqlite3.connect("temps.db") # connect to the temps.db we created
cursor = conn.cursor()
cmd = \
f"""
SELECT S.id, S.name, S.LATITUDE, S.LONGITUDE, S.STNELEV, C.`Country Name`, T.year, T.month, T.temp
FROM temperatures T
LEFT JOIN stations S ON T.id = S.id
LEFT JOIN countries C on T.FIPS = C.`FIPS 10-4`
WHERE T.year BETWEEN {year_begin} AND {year_end} AND T.month == {month}
"""
if country is not None:
cursor.execute(f"SELECT `FIPS 10-4` FROM countries WHERE `Country Name`='{country}'")
country_code ,= cursor.fetchone()
cmd += f"AND S.id LIKE '{country_code}%'"
df = pd.read_sql_query(cmd, conn)
conn.close()
return df
This function underwent pretty significant changes thanks to feedback, if not in overall look, but definitely in idea. First and foremost, I changed the SQL query to also join on the countries, because that’s basically what I was doing in a pandas step later in the function.
Previously, I had some terribly inefficient second part of the code like this:
df["FIPS 10-4"] = df["ID"].str[0:2]
countries = pd.read_sql_query("SELECT `FIPS 10-4`, `Country Name` from countries", conn)
df = pd.merge(df, countries, on = ["FIPS 10-4"])
but as you might have seen in the earlier second, parsing that FIPS column is taken care of in the prepare_df step now, which I think makes a lot more sense.
I also learned that you can actually have two left joins in one SQL query, so we don’t have to perform that second merge. Making it all one SQL query is not only more efficient, but much easier to read as well.
I also helped one of my peers out by providing example of how to use .format() for strings with keyword arguments to make the SQL queries a lot more readable instead of having the positional arguments that weren’t named anything informative. In fact, that reminded me that I could even use the f'{variable}' method for my own code which would also look a bit more easy to read.
Cool, let’s try the function out.
df = query_climate_database(
year_begin = 0,
year_end = 2020,
month = 1)
df.head()
| ID | NAME | LATITUDE | LONGITUDE | STNELEV | Country Name | Year | Month | Temp | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | ACW00011604 | SAVE | 57.7667 | 11.8667 | 18.0 | Antigua and Barbuda | 1961 | 1 | -0.89 |
| 1 | ACW00011604 | SAVE | 57.7667 | 11.8667 | 18.0 | Antigua and Barbuda | 1962 | 1 | 1.13 |
| 2 | ACW00011604 | SAVE | 57.7667 | 11.8667 | 18.0 | Antigua and Barbuda | 1963 | 1 | -7.13 |
| 3 | ACW00011604 | SAVE | 57.7667 | 11.8667 | 18.0 | Antigua and Barbuda | 1964 | 1 | 0.62 |
| 4 | ACW00011604 | SAVE | 57.7667 | 11.8667 | 18.0 | Antigua and Barbuda | 1965 | 1 | 0.44 |
Plot 1: Geographic Scatter Plot for Yearly Temperature Increases
Now that we have a good way to query the database, we can finally start making some neat plots. In order to do that, we’re going to be relying on plotly, a really neat library that’ll let us make plots really easily that not only look good, but are even interactive.
from plotly import express as px
Now our first plot will aim to answer the question:
How does the average yearly change in temperature vary within a given country?
To answer this, we’ll first need some way to measure average yearly change. We essentially just need some way to fit a slope to show how much temperature rises (or falls) over the “run” of time. Actually, what would work perfectly would be a linear regression! Let’s remember how to do that from PIC 16A.
from sklearn.linear_model import LinearRegression
def coef(data_group):
x = data_group[["Year"]] # 2 brackets because X should be a df
y = data_group["Temp"] # 1 bracket because y should be a series
LR = LinearRegression()
LR.fit(x, y)
return round(LR.coef_[0],2)
Now we’ll use this fancy new function that summarizes the yearly data from a particular station along with the query function in order to make a final process that generates the figure based on user-provided inputs.
Something important to note here: this linear regression won’t be very informative if we only have say two years of data recorded for this particular range at some station; in fact it’s likely to be misinformative because we might get an abnormally large change in those years which would suggest unusual change compared to what might be seen if we had more years to smooth out the trend.
To alleviate this potential missing data flaw, we also provision for a min_obs argument which will only include stations that have more than that number of observations in the chunk of data we pull.
def temperature_coefficient_plot(year_begin, year_end, month, min_obs, country=None,**kwargs):
"""
Plots all stations in a user specified countries in a scatter fashion, with
of each color controlled by a linear regression slope calculated by the coef function,
estimating how the temperature on that specified month changed over many years.
The min_obs argument serves to filter out stations without enough data to get a
reliable linear regression line.
User-specified kwargs are then passed to the plotly call
"""
# retrieve the piece of data we're interested in
# using previously built query function
df = query_climate_database(year_begin, year_end, month, country=country)
# the number of observations for a particular station is the length (i.e. number of elements)
# in each group when grouping by station name
df["Num Obs"] = df.groupby("NAME")["Year"].transform(len)
# then we can just use boolean indexing to filter
# out columns without enough observations
df = df[df["Num Obs"] >= min_obs]
# we don't need to groupby latlon per se, but doing this helps bring those
# coordinates into the coefs df, which we'll need to actually feed into plotly
coefs = df.groupby(["NAME", "LATITUDE", "LONGITUDE"]).apply(coef)
coefs = coefs.reset_index()
coefs = coefs.rename(columns = {0: "Estimated Yearly Increase (°C)"})
fig = px.scatter_mapbox(coefs,
lat="LATITUDE",
lon="LONGITUDE",
hover_name="NAME",
color = "Estimated Yearly Increase (°C)",
**kwargs)
return fig
I also made sure to add way more comments to this function, as well as other functions.
Cool. Now let’s try out the function.
color_map = px.colors.diverging.RdGy_r # choose a colormap
fig = temperature_coefficient_plot(1980, 2020, 1,
country="India",
min_obs = 10,
zoom = 2,
mapbox_style="carto-positron",
color_continuous_scale=color_map,
title="Estimates for yearly increase in temperature in January\nfor stations in India, years 1980-2020")
fig.show()
Plot 2
Which countries have increased their measurements most over a period of time?
I am especially interested in asking this question from not quite a scientific, but rather a historical viewpoint. For example, we could ask questions like during the 60’s period of the Cold War, can we see governments’ heightened interest in science reflected in amount of meteorological data collected? Or maybe even more narrowly, can we see the economic or political status of certain countries reflected in this difference as well?
Either way, to answer these question, we need to get a chunk of data from our database. For this, we don’t really care about the elevation of these stations, just the total number. Since we’re also trying to the difference in many different countries at once (the plot would be very boring if we could only do one country at a time; it’d essentially be a single number), we’d better get data from all countries. Let’s choose between 1980 and 1990, which is a period in history where I believe a lot of interesting events were happening.
df = query_climate_database(
year_begin = 1980,
year_end = 1990,
month = 1)
Now this is one limitation of our query function I suppose, but we’re not very interested in the records for years between 1980 and 1990, but we only care about the difference. That means we only need data at the endpoint.
df = df.loc[(df["Year"] == 1980) | (df["Year"] == 1990)]
This is where things get a little hairy. I’m first going to need to count the number of unique ID’s that gave us records in a certain country on a per-year basis. (The number of unique ID’s should be the same as the number of active stations that year.)
But that description is literally just a groupby operation, I just need the nunique function to count the number of unique values in a Pandas series.
station_num_df = df.groupby(by=["Country Name", "Year"], as_index=False).agg({'ID': pd.Series.nunique})
But now, I need to, for each country, find the difference between its 1980 and 1990 number of stations.
However, I did run into an issue. This is very standard around country data like this, but I notice the US was a huge outlier just because it was probably the most economically developed nation at the time and had a lot of stations being build. This made it very hard to distinguish colors in the rest of the plot.
To perform some kind of normalization, I decided not to just take the difference, but to calculate the percent change as something to the tune of
which would pretty much tell me the factor by which the amount of stations grew in this period.
import numpy as np
def calc_diff(data_group):
if len(data_group) != 2:
return np.nan
return (data_group.iloc[1]["ID"] - data_group.iloc[0]["ID"])/data_group.iloc[0]["ID"]
station_num_diff = station_num_df.groupby("Country Name").apply(calc_diff).rename("Growth Factor").to_frame()
station_num_diff['Country Name'] = station_num_diff.index
station_num_diff.head()
| Growth Factor | Country Name | |
|---|---|---|
| Country Name | ||
| Afghanistan | -0.400000 | Afghanistan |
| Albania | 0.750000 | Albania |
| Algeria | -0.230769 | Algeria |
| American Samoa | 0.000000 | American Samoa |
| Angola | NaN | Angola |
(Side note, I needed to catch the case where len(data_group) != 2 just because there is the possibility that there somehow were no measurements in either 1980 or 1990 for a country. In this case, I don’t want to make any assumptions about what the difference should be, so I just let the different become nan.)
grew = station_num_diff.loc[station_num_diff['Growth Factor'] >= 0]
shrank = station_num_diff.loc[station_num_diff['Growth Factor'] < 0]
from urllib.request import urlopen
import json
countries_gj_url = "https://raw.githubusercontent.com/PhilChodrow/PIC16B/master/datasets/countries.geojson"
with urlopen(countries_gj_url) as response:
countries_gj = json.load(response)
fig = px.choropleth(grew,
geojson=countries_gj,
locations = "Country Name",
locationmode = "country names",
color = "Growth Factor",
height = 300,
# color_continuous_scale = [c for c in px.colors.named_colorscales()],
title="Factor by which the number of weather stations grew between 1980 and 1990")
fig.show()
As you might expect, it looks like the West had a good amount of growth during this period, but a very interesting hotspot is actually in the Middle East. In fact, it looks like Oman is a bit of an outlier ending up in 1990 with 3.5 times station observations as in 1980. I’m not going to pretend as if I know the history of the Middle East very much detail, but I know for sure that the biggest event then was the Iran-Iraq War which spanned most of the decade. You would think that would cause a decrease in this kind of scientific data collection, but who know? Maybe all the war actually spurred on the advancement of climate research even more for other reasons I’m not sure of.
Either way, let’s now look at the countries that had a decrease in number of station observations over this period.
fig = px.choropleth(shrank,
geojson=countries_gj,
locations = "Country Name",
locationmode = "country names",
color = "Growth Factor",
height = 300,
title="Factor by which the number of weather stations shrank between 1980 and 1990")
fig.show()
Neat. We can see that there are fairly consistent “areas of shrinkage”, in that many of the countries that had a decrease in number of station recordings are clustered together. For example, one seemingly large region is in South America, and also in Mexico. I bet this is due to the so called Lost Decade economic crisis during that period, so many stations probably lost funding.
I also find it really interesting that China is actually the country that had the biggest decrease in number of stations. I suspect this is another observation we can contextualize in some of the history; during that post-Mao period under Deng, there was so much focus on political reform that studying climate and maybe other kinds of research fell to the wayside.
Now let’s put this all into one big nice function.
def plot_station_diff(year_begin, year_end, **kwargs):
"""
Creates two plots, one highlighting countries where the number of stations reporting temerature data
grew and one highlighting those where the number shrank in a user specified start and end year.
Then passes user-specified kwargs to the plotly calls.
"""
df = query_climate_database(
year_begin = year_begin,
year_end = year_end,
month = 1)
df = df.loc[(df["Year"] == year_begin) | (df["Year"] == year_end)]
station_num_df = df.groupby(by=["Country Name", "Year"], as_index=False).agg({'ID': pd.Series.nunique})
station_num_diff = station_num_df.groupby("Country Name").apply(calc_diff).rename("Growth Factor").to_frame()
station_num_diff['Country Name'] = station_num_diff.index
station_num_diff.head()
grew = station_num_diff.loc[station_num_diff['Growth Factor'] >= 0]
shrank = station_num_diff.loc[station_num_diff['Growth Factor'] < 0]
grew_fig = px.choropleth(grew,
geojson=countries_gj,
locations = "Country Name",
locationmode = "country names",
color = "Growth Factor",
height = 300,
title=f"Factor by which the number of weather stations grew between {start_year} and {end_year}",
**kwargs)
shrank_fig = px.choropleth(shrank,
geojson=countries_gj,
locations = "Country Name",
locationmode = "country names",
color = "Growth Factor",
height = 300,
title=f"Factor by which the number of weather stations shrunk between {start_year} and {end_year}",
**kwargs)
return grew_fig, shrank_fig
grew_fig, shrank_fig = plot_station_diff(1980, 1990)
grew_fig.show()
shrank_fig.show()
Note: I should add the disclaimer that due to the nature of this dataset and the way I’m calculating this growth factor, this might not accurately reflect the actual number of stations in the country - recall that I am indeed just counting the number of stations reporting numbers in a particular year. It could very well be that that station just didn’t add to this dataset for one year, but was still just as active. I just figured that these cases wouldn’t happen often enough to take away from the overall message of the plots.
A peer’s comment reminded me it’s good to comment on assumptions/limitations made with my data processing and plotting methods.
Plot 3
Either way though, we should probably do something with the actual climate data that we have. This is the next question I have:
How is temperature variance related to latitudinal position (how far north and south)?
While it might not be very applicable in a research sense, this is something I’m actually quite curious to actually see in a neat data visualization. Because we know that seasonal variance is higher the farther you get from the equator thanks to the tilt of the Earth, it’s a bit less clear to me what the variance will look like year over year on the same month. My instinct is to think that locations farther from the equator will have higher variance, but I’m not even sure why. This plot will definitely help make clear the difference.
Either way, here’s the function that will generate our plots. The comments explain every step of the process.
def latitude_diff(country, year_begin, year_end, month, **kwargs):
"""
Plots line graphs of the percentage change in temperature over time
for both the stations with the
"""
# use our handy query function to retrieve data in the time period and country
# of interest
df = query_climate_database(country=country,
year_begin = year_begin,
year_end = year_end,
month = month)
# having duplicates will throw off the pct change, so throw those away
# using the subset argument allows us to find entries that have the same
# station name and year
df = df.drop_duplicates(subset=['ID', 'Year'])
# To more clearly see the difference that latitude can make, we should find the stations with
# highest and lowest latitude. This can be accomplished just by using `.max`/`.min` on a series to
# find the greatest/lowest latitudes, and then use boolean indexing to get data related to that
# location.
highest = df.loc[df['LATITUDE'] == df['LATITUDE'].max()]
lowest = df.loc[df['LATITUDE'] == df['LATITUDE'].min()]
# concatenate northmost and southmost observations so we only have to run pct_change once
# on the whole DataFrame
combined = highest.append(lowest)
# You might be worried that we're mixing the data which would throw off our `pct_change`
# application, but we can just make a `groupby` on the station name so that we only find the
# `pct_change` within a certain station, which is exactly what we want.
combined["pct_change"] = combined.groupby("NAME")["Temp"].transform(pd.Series.pct_change)
fig = px.line(combined,
x="Year",
y="pct_change",
color='NAME',
**kwargs)
return fig
Here’s another example of adding more docstrings. More importantly, though, I also condensed the explanation blocks into the function itself, mostly because there was no real intermediate outputs being displayed during the step-by-step explanation, which kind of defeated the purpose. So now I’m just relying on the comments to do the explaining.
fig = latitude_diff(country="Australia",
year_begin=1960,
year_end=2020,
month=6,
title=f'June Temperature variance for the north and southmost stations in Australia between 1960 and 2020')
fig.show()
We once again have a bit of a missing data problem, but we actually get another bit of an interesting result! I was really confused for like 15 seconds on why we get the opposite result with the more southern station being more variant, but Australia is in the Southern Hemisphere, so south actually means closer to the pole!